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Abstract 



Hutter ( 20071 ) recently introduced the loss rank principle (LoRP) as a general- 



purpose principle for model selection. The LoRP enjoys many attractive prop- 
erties and deserves furth er investigations. The L oRP has been well-studied for 
regression framework in iHutter and TrarJ d20ich . In this paper, we study the 



LoRP for classification framework, and develop it further for model selection 
problems in unsupervised learning where the main interest is to describe the 
associations between input measurements, like cluster analysis or graphical 
modelling. Theoretical properties and simulation studies are presented. 
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1 Introduction 



Model selection. Model selection is an important problem in machine learn- 
ing and statistics. Typically, model selection can be regarded as the question of 
choosing the right model complexity. The maximum likelihood principle (MLP) 
breaks down when one has to select among a set of nested models, because then 
the MLP always selects the biggest model (w.r.t. inclusion). Overfitting is a serious 
problem in structural learning from data. Much effort has been put into develop- 
ing model selec tion cri t eria t hat can avoi d overfit t ing. T he most popular ones ar e 



probably AIC ( Akai 



the MDL (IRissanen 



jjl973h . the BIC (ISchwarj : |l97^ ) the C v (iMallowsl. jl973|) 



19781 ). cross-validation ( Allen] . 1974 ; Craven and Wahba 



and c riteria based on Kademacn er complexit i es (iKoltcnmsknl. l200ll; iriartlett et al.l . 
2002 ). The reader is referred to Shao (11996 ); Hutter and TranT^OlO ) for compar- 
is ons of /s o me of t hese criteria. The loss r ank principle (LoRP) introduced recently 



1979) 



Bartlett et al 



m iHutterl (120071 ); iHutter and Tranl ( 120101 ) is another contribution to the model se- 
lection literature. The LoRP, as it is named, is a general-purpose principle for model 
selection rather than a specific criterion. The LoRP can be regarded as a guiding 
principle for deriving model selection criteria that can avoid overfitting. It has the 
advantage of always giving answers, even in cases where the other criteria can not 
be used. 

The loss rank principle. Consider the problem of selecting a model among a 
given set of models Ai achieving some kind of optimality properties. The main goal 
of the LoRP is to establish a selection criterion that is able to specify a parsimonious 
model that fits the data not too bad. General speaking, the LoRP consists in the 
so-called loss rank of a model defined as the number of other (fictitious) data that 
fit the model better than the actual data, and the model selected is the one with 
the smallest loss rank. 

bri efly 



We 



now 



present the LoRP developed in IHutterl (120071 ) and 



Hutter and Tranl ( 120101 ) for supervised learning settings. In supervised learning, the 
data is categorized into input and output, and the main interest is to develop a model 
for predicting output based on input. Let D = (x,y) = {(xi,yi) .. ,(x n ,y n )} £ (X xy) n 
be the (actual) training data set with x = (xi,...,x n ) are inputs and y = (yx,...,y n ) 
are (disturbed) outputs. Suppose that we use a model MeM to fit the data D, 
e.g., M is a linear regression model with d covariates, or M is a A;- nearest neigh- 
bors regression model. Imagine that in experiment situations we can conduct the 
experiment many times with fixed design points x. We then would get many other 
(fictitious) outputs y'. Let Lossm(v\x) be the empirical loss associated with a cer- 
tain loss function when using a model Mg A4 to fit the data set {x,y). For instance, 
Loss m{v\ x ) can be the least squares error, or the negative maximum log-likelihood 
when a sampling distribution is assumed. The loss rank of model M then is defined 
as 

LR(M\D) := fj, {y' G y n : Lo^ M (y'\x) < Loss M (y\x)} (1) 
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with some measure fi on y n . For example, fi can be the counting measure if y is 
discrete, the usual Lebesgue measure on M n if y = M, or some empirical probability 
measure (see Sections below). Intuitively, the loss rank is large for too flexible 
models fitting D = (x,y) well and also for too rigid models that fit D not well (in 
both cases the model fits many other D'=(x,y') better). For example, consider the 
polynomial regression problem where (£j,2/i) GiR 2 , a higher order polynomial would 
fit D well and also fit many o ther data D' well, thus result ing in a large loss rank. It 
was argued in iHutterl (120071 ) and iHutter and Tranl (120 lOf ) that minimizing the loss 
rank is a suitable model selection criterion which trades off the quality of fit with 
the model flexibility. 

The LoRP has bee n well studied for regression with continuous response 
( IHutter and Tranl . 120101 ) . With continuous data and under squared loss, the loss 
rank has a closed form and many optimality properties of the LoRP have been 
pointed out. For example, the LoRP (i) is model selection consistent in some spe- 
cial cases; (ii) reduces to Bayesian model selection in linear basis function regression 
with Gaussian prior; (iii) h as a minimum descriptio n length interpretation (inter- 
ested readers are referred to Hutter and Tran ( 2010l ) for the details). Furthermore, 
the LoRP in super vised l earni ng settings has been proven efficient in some spe- 
cific applications. iTranl ( 120091 ) demonstrated the use of LoRP for selecting the 



ridge parameter in ridge regression, while it was shown in ITranl (120101) th at shrink- 
age parameters in regularization procedures like Lasso (ITibshiranil . Il996l ) or SCAD 
( jFan and Lil . 1200 ll ) selected by the LoRP enjoy good statistical properties. 

The LoRP seems to be a promising principle with a lot of potential, leading 
to a rich field. We would like to emphasize that the LoRP should be regarded as 
a guiding principle which in specific applications helps to derive model selection 
criteria that can avoid overfitting. This paper continues our investigation of the 
LoRP as a general-purpose procedure for model selection. We first study the LoRP 
for classification framework where the response is discrete. Based on the LoRP, we 
derive a model selection criterion for classification and show that minimizing the 
criterion is asymptotically equivalent to minimizing an ideal criterion which is only 
known when the population distribution is known. 

Second, we develop the LoRP for model selection in unsupervised learning set- 
tings. This unsupervised learning LoRP then is studied by means of simulation 
in two specific applications: selection of number of clusters in cluster analysis and 
model selection in graphical modelling. The simulation shows that the model se- 
lection criteria derived from the LoRP work well and are competitive to existing 
ones. 

We end this introduction section by listing some attractive properties of the 
LoRP. The LoRP 

• always gives answers; 

• does not require insight into the inner structure of the problem; 

• does not require any explicit setting of the stochastic noise structure, i.e. no 
assumption of sampling distribution is needed; 
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• would work with any loss function. 



2 Model Selection by Loss Rank for Classification 

We consider in this section the model selection problem in a (binary) classification 
framework. Let D = {(Xi,Yi),...,(X n ,Y n )} be n independent realizations of random 
variables (X,Y), where X takes on values in some space X and Y is a {0,l}-valued 
random variable. We assume that these pairs are defined on a probability space 
(fi,£,P) with Q = (Xxy) n . We are interested in constructing a predictor t: X — ?- {0,1} 
that predicts Y based on X. The performance of the predictor t is ideally measured 
by the prediction loss 

P 7 (f) = Pfe t( x)) = P(Y * *(*)) (2) 

where j(t)(x,y) :=I y ^ t (x) is called the contrast function. Hereafter, for a measure \x 
and a /x-integrable function /, we denote the integral f fd/j, by fif or 

Ideally, we want to seek an optimal predictor s that minimizes P^y(t) over all 
measurable t:X— ^{0,1}. However, finding such a predictor is impossible in practice 
because the class of all measurable functions t:X— >{0,1} is huge and typically not 
specified. Instead, we may restrict to some small class of predictors J- ' . A question 
arises immediately here: how small should the class T be? A too small J 7 may 
lead to an unreasonable prediction loss, while finding an optimizer in a too large 
T may be an impossible task. Therefore the class/model T itself must be selected 
as well (the terms class and model will be used interchangeably). In this paper, we 
are interested in the model selection problem in which we would like to find a good 
model (in a sense specified later on) in a given set of models {J- m , me M.}. 

The unknown prediction loss (j2j) is often estimated by the empirical risk 

n 

Pnl(t)= 1 -J2 I Y^) ( 3 ) 
1 

where P n is the empirical measure based on data D 

n 

P ™ = n 

1 

with 5 X denotes the Dirac measure at x. For a class J- m , one may seek a function 
t m minimizing P n / y(t) over t G T m . Unfortunately, it is well-known that such a 
method leads to overfitting: the larger T m , the smaller the empirical risk P n 7(t m ). 
Consequently, the selected model is always the biggest one. This leads to the idea of 
accounting for the model complexity, in which we select a model m that minimizes 
the sum of the empirical risk and a penalty term taking the model complexity into 
account. 
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Because P„7(t) underestimates P7(£), a well-known regularized criterion for 
model selection is to penali ze the approximati o n on T m . of the predic t ion lo ss by 



moaei selection is to penali ze tne approximati o n on j- m 01 tne predic t ion lo s; 
the empirical risk (see, e.g., iKoltchinskii] (l200lh : IFromontl ( l2007h : krlotl (120081 )) 



critnfm) 



+ sup (P 



Pn)7(*)- 



(4) 



The second term, denoted by pen n (m), is a natural measure of the complexity of 
class J- m , which measures the accuracy of empirical approximation on class J- m . 
Then, the model to be selected is m n = argmin m {crit n (m)}. For simplicity, we 
assume throughout the paper that m n is uniquely determined. 

In practice, P is unknown and so is pen n (m). One has to estimate 
pen n (m). Many meth ods have been proposed to e stimate this theoretical 
pena lty: VC-dimension ( Vapnik and Chervonenkisl . 1971 ). Radem acher complexi- 
ties (Koltchinskiil . l2001t iBartlett et al.l . 120021 ) . resampling penalties (IFromontl . 12007 ; 
Arlotl . 120081 ). All of these methods give upper bounds for pen n (m). The perfor- 
mances of the methods are measured in terms of oracle inequalities. The sharper 
the estimate is, the better the performance is. These methods often works well in 
practice but are not without problems. For example, the VC-dimension is often 
unknown and needs to be estimated by another upper bound, Rademacher com- 
plexities are of t en cr i ticized to be too l arge (the local Rademacher complexities 



(IBartlett et al.l . 120051 ; iKoltchinskiil . 120061 ) have been introduced to overcome this 



drawback, however the latter still suffer from the hard-calibration problem because 
they involve unknown constants). 

In this section, based on the LoRP, we propose a criterion to estimate the model 
m n directly, not pen n . Instead of giving an upper bound for pen n (m), we directly 
estimate m n by minimizing a criterion over models mEAi. Minimizing the criterion 
is asymptotically equivalent to minimizing crit n (m) with probability 1 (Theorem [T]). 

In Section 12.11 the suggested criterion is derived and its model consistency is 
proven. In Section I2.2[ we discuss the implementation and carry out a numerical 



example to demonstrate the criterion and compare it to other methods. 



2.1 The loss rank criterion 

The LoRP, as it is named, is a guiding principle rather than a specific selection 
criterion. When it comes to apply in a specific context, a suitable choice of measure 
H in (JTJ) is needed. For continuous data cases, using the usual Lebesgue measure 



in M n leads to a closed form of loss rank and meaningful results ( IHutter and Tran 



20101 ). In our current context of the binary classification, some suitable probability 
measure on 3^ n = {0,l} n should be used to define the loss rank. To formalize this, we 
define the loss rank of a model as the probability that a randomly resampled sample 
fit the model better than the actual sample. This definition of the loss rank makes 
it not only possible to estimate the loss rank but also makes use of the available 
theory of resampling to justify the method. 
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We now formally define the loss rank. Let ri, i = l,...,n be n independent 
Rademacher random variables, i.e. takes on values either —1 or 1 with prob- 
ability 1/2. The rj's are assumed to be independent of D. Let Y( := -4p — r^, 
i.e. we flip the value/label of Yi with probability 1/2. The loss rank of model m is 
defined as 

« 

LR n (m) = LR n (J- m ) := P R ( inf ^Iyf^x,) < P n y(i m )\D) (5) 

where Pr(.\D) means the conditional probability w.r.t. the Rademacher sequence 
given data D. Intuitively, the empirical risk based on the actual D would be small 
for a too flexible class T m , but many resamples D' would then also result in small 
empirical risk, which leads to a large loss rank LR n (m). Therefore, minimizing the 
loss rank helps avoid overfitting. Also, a too rigid T m fitting D not well would lead 
to a large loss rank as well. Thus, the loss rank defined in (jSJ) is a suitable criterion 
for model selection which trades off between the fit (empirical risk) and the model 
complexity. 

LR n (m) is directly estimable by a simple Monte Carlo algorithm (see the next 
section). Then the selected model will be TOLR=argmin mej v<LR n (m). We name this 
method the loss rank (LR) criterion. 

Optimality property. We now discuss the model consis tency of the LR criterion by 



using the modern theory of empirical processes (see, e.g.. Ivan der Vaart and Wellner 



( 119961 )). To avoid dealing with difficulties of non-measurability in empirical process 
theory, we as usual assume that for each m<EA4, class T m is countable. We need 
the following regularity condition: 

(C) T> m = { r y(t),tEJ r m }, m£j\A are Donsker classes. 

Recall that a function class V is called a Donsker class if y/n(P n — P)/ converges in 
probability to iV(0,P(/ — P/) 2 ) uniformly in /gP. This, together with another con- 
dition that P(sup /eJ -|/-P/| 2 ) < oo (which is automatically satisfied in our context 
because j(t) < 1 for every predi ctor t) are essential in order for th e weak convergence 



of empirical processes to hold ( Ivan der Vaart and Wellnerl . Il996l . Chapter 3). These 



are al so two essential condit ions in or der for Efron's bootstrap to be asy mptotically 
valid ( IGine and Zinnl . Il990l ) (see also Ivan der Vaart and Wellner (jl996h ). 



Theorem 1. Under Assumption (C), minimizing LR n (m) over m & M. is asymp- 
totically equivalent to minimizing the ideal criterion crit n (m) with probability 1, i.e. 
rriLR is a strong consistent estimate of m n . 

On one hand, LR criterion is closely related t o penali z ed m odel selection based 



on Rademacher complexities. As being realized by lLozanol (120001 ). a very large model 
which generally contains a predictor predicting correctly most of randomly generated 
labels results in a large Rademacher penalty. While a very large model will result 
in a large loss rank which is defined as the probability that a randomly relabeled 
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sample behaves better than the actual sample. On the other hand, LR criterion is 
quite different from Rademacher complexities model selection. While Rademacher 
complexities give upper bounds for the ideal penalty pen n (m), LR criterion offers a 
way to directly estimate the ideal model m n . 

Proof of the theorem. By Y- := "4p — r^, it's easy to see that IyfjH(Xi) — In=i~ 
nlYi&iXi), therefore 

n n n 

i? f I J n'#(*<) = IH J ^=i - SU P r iH+t{x % y (6) 

" f, 

1 11 

Moreover, 

n n n 

1 11 

where := X^Wi^pQ,^) with Wi'.— l— rj~2Binomial(l,l/2) is the weighted boot- 
strap empirical measure. From ©-(IT]) and ([5]), we have 

n 

LR n (m) = pJ sup (P„ - P*) 7 (t) > ir/ n= i - P nl (t m )\D). 

The key point in the proof is the result of weak convergence of the weighted bootstrap 
empirical processes. The result states that, under Assumption (C), the difference 
between the conditional la w of Pn— Pff given data D and the law of P— P„ converges 
to zero almost surely (see ( van der Vaart and Wellnerl . 1996 . p. 346)). More formally, 
let G n = P n — P^ and G n = P — P n , and let l°°{V m ) be the space of all bounded 
functions from D m to the real set M (G n and G n are random elements in Z°°(D m )). 
Then 

\E R h(G n ) - Eh(G n )\ -»■ 0, P - almost surely 

for every continuous, bounded function h : l°° (T> m ) — > M. 

Therefore, by the continuous mapping theorem with notice that ~$^"-^-i=i — >l/2 
a.s., we have P-almost surely 

P jB (sup t6 ^ m (P B -P*) 7 (t) > ^El^=l-Pn7(U|^) 

-p(su Pt£Fm (p - p n ) 7 (0 > \ - p n7 (U) I -> 0. 

Thus, as n is sufficiently large 

LR„(m) = P (sup (P - P n )j(t) > | - P n j(i m ) J = P(crit„(m) > §) w.p.l. 
\teT m J 

For simplicity, suppose now that LR n (m) has a unique minimum at rh-LR. If 
m hR m n , P(crit n (m n ) > ~) > P(crit n (m LR ) >\). On the other hand, crit n (m„) < 
crit n (mLR) by the definition of m n , so P(crit n (m n ) > |) < P(crit n (mLR) > \). The 
contradiction implies rhi 1 n = m n w.p.l. ■ 
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2.2 Implementation and Simulation 



Implementation. The loss rank LR n (m) can be easily estimated by a simple Monte 
Carlo algorithm as follows: 

1. LR„(m)<-0. 

2. Toss a fair coin n times and define 



Y' 



head occurs at i-th time 



1,2,. ..,ra. 



1 — Yi, tail occurs at i-th time 

If inft e ^ m iJ]r^Vt(Xi)<PnT(*m) then LR n (to) LR„ (to) +1/B. 

3. Repeat step 2, 5 times. 

The theoretical justification for this algorithm is the law of large numbers: 
LR n (m) — y LR n (m) a.s. as B — > oo. In the following simulation, B is taken to 
be 200. From our experience, the results do not change much if a larger B is used. 

A numerical example. We now demonstrate the method by a simple example of 
a piecewise constant classifier with 2 m segments, and compare it to model selection 
based on Rademacher co mplexities . Cons ider the int ervals model se l ection problem 



which was described by iFromontl ( 120071 ) (see also, lLozanol ( 120001 ) ; iBartlett et al. 
(l2002t )). Given a number NeJN, let X = {1,2,...,2 N }. For u,veJN,u<v, denote by 
lN[u,v] the set of integers in interval [u,v]. For an integer number m, l<m<N, let 



T m = It: X ^ {0,1}, t = J2 c kl 



JV[(fc-l)2 JV - m +l,fc2 JV - m ] , 



c k e {0,1}, fc = i,...2 r 



k=l 



be the set of piecewise constant functions defined on X and taking on values {0,1} 
with possible jumps at k2 N ~ m , k = l,...,2 m — 1. 

For a given m , l<m <N, let S be the set of odd-numbered segments: 



S = \J N[(k - l)2 N ~ mo + 1, k2 N - mo }. 

fc=2p+l, p=0,li-i2 m o _1 - 1 

Let X be a uniformly distributed random variable on X and Y be a {0,l}-valued 
random variable defined as 



P(y = 1\X e S ) = \ + h, and P(y = 1\X i So) = \ - h 

where /i6(l,|) is called the margin parameter. We now have a model selection prob- 
lem with N candidate models {J- m , meM = {1,...,N}} and the optimal predictor 
s(x)=Is (x) G J- mo belongs to one of them. We are interested in identifying the true 
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First case: N=8, rn Q =2, h=0.1 



Second case: N=8, rn =4, h=0.2 



Figure 1: The plots of true functions and data for two cases. 

model mo- The advantage of the intervals model selection problem is that it is very 
easy to compute for each meM 

n n 

Pnl(im) = inf l^Iy^tix,) and sup ± ^ n^tpQ) . 



The reader is referred to iFromontl (120071 ) for the details. 



We compare LR crite rion to another criterion based on Rademacher complexities 



which is taken following IFromontl (120071 ) to be 



crit RC (m) = P„7(t m ) + pen RC (m) with pen RC (m) = E{ sup \ S ^r i I Yl ^t(x l )\D) 

ter m j=1 

We shall call this the Rademacher complexity (RC) criterion. In our experiment, 
Rademacher complexities pen RC (m) are estimated also by 200 Monte Carlo simula- 
tions. 

Figure [1] plots true functions and observation data (with n = 100) for two cases: 
first with N — 8, mo = 2, h—.l, then N = 8, mo = 4, h = .2. These pictures show how 
hard it is to decide intuitively what the true model is. Figure [2] plots LR criterion 
and RC criterion. Both criteria identify the true model in both cases. 

Table 1 presents the proportions of correct identification over 100 replications 
for each of 16 cases with various sample sizes n = 50, 100, 200, 300 and noise levels 
h = .05, .1, .2, .3 (m =4). It is shown that both criteria are model selection consistent 
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First case: N=8, m =2, h=0.1 Second case: N=8, m D =4, h=0.2 

0.7 | , , , , 1 , 1 1, , , , 1 r 




Complexity Complexity 



Figure 2: The plots of LR criterion and Rademacher complexity criterion. 



as the proportions increases to 1 as n and h increase. The simulation suggests that 
the LR criterion has an improvement over the RC criterion for large sample sizes. 



n 


h 


LR criterion 


RC criterion 


n 


h 


LR criterion 


RC criterion 


50 


.05 


.12 


.13 


200 


.05 


.23 


.21 




.1 


.35 


.35 




.1 


.67 


.66 




.2 


.62 


.64 




.2 


.99 


.97 




.3 


.95 


.97 




.3 


1 


1 


100 


.05 


.15 


.15 


300 


.05 


.30 


.28 




.1 


.41 


.41 




.1 


.78 


.76 




.2 


.89 


.90 




.2 


1 


.99 




.3 


.98 


.98 




.3 


1 


1 



Table 1: Proportions of correct identification of LR and RC criterion for various n 
and h. 



3 The LoRP for unsupervised learning 

The LoRP developed so far is for supervised learning settings only. In supervised 
learnings, there are measurements called inputs which are used to predict outputs. 
Note that, in such settings, we have fixed the inputs x in the definition of the 
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loss rank, and "resample" only the outputs y. This seems to have some physical 
interpretation in supervise d learnings and more im portantly leads to a closed form 
of loss rank in many cases (IHutter and Tranl . 120101 ) . Such a way is not applicable to 
unsupervised learning settings where there is no outputs. For example, in graphical 
modelling or cluster analysis, the question of interest is to explore the associations 
between a set of input measurements. Fortunately, the basic reasoning of LoRP 
can be straightly extended to unsupervised learning. It is worth recalling the key 
observation of the LoRP: too flexible models will fit the actual data well and also fit 
fictitious/resampling data well ("fitting well" here means "having a small empirical 
loss"). Let x — (xi,...,x n ) be the actual data set and LossA/(a;) be the empirical loss 
when fitting data a? by a model M. Assume that the empirical loss has the property 
that the more flexible M, the smaller Lossjv/(*)- Let x' be a resample from x using 
some resampling scheme (e.g., boostrapping). We can now define the loss rank of 
model M as 

#{£c' : Lossm(x') < Loss M (a;)}. 

This definition is easily understood intuitively but not very practical because the 
total number of resamples x' is often huge or infinite. To make it more practical, we 
can proceed as follows. Let B be the set of B resamples x' from x. The loss rank 
now can be defined as 



LR fl (M|aj) 



#{x' G B : Lossm(^') < Lossm(^)} 
B ' 



Mathem atically, let P„ be the empirical probability measure of th e resampling 



scheme ( lEfron and Tibshiranil . Il993l ; Ivan der Vaart and Wellnerl . I19961 ) , we formally 
define the loss rank as 



LR(M\x) = P n {x' : Lossm^') < Loss M (a;)}. 



(9) 



Clearly, the loss rank defined in (jHJ) is an estimate of the one defined in (|9]). In the 
next section, we will study the unsupervised Lo RP by means of simulation. The 
resampling scheme used is the popular bootstrap ( lEfron and Tibshiranil . Il993l ). 



4 Simulation studies for unsupervised LoRP 

In this section, the unsupervised LoRP will be applied to selecting good models in 
graphical modelling and selecting number of clusters in cluster analysis. 

4.1 LoRP for choosing number of clusters 

Cluster analysis ( Hastie et all 20051 Ch.14) is an important problem in unsupervised 



learning. The goal is to group a collection of objects into clusters such that objects 
within each cluster are more closely related to each other than objects assigned to 
different clusters. In some applications, the number of clusters K may be known 
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in advance but in most cases K is unknown and must be selected based on the 
data. Popular methods for m odel selection such a s AIC, BIC or coss-validation are 



not applicable here (see, e.g., lHastie et al.l (120051 ) . Ch.14). Let x = (xi,...,x n ) be n 



objects and d(xi,Xj) be the distance (or dissimilarity measure) between x» and Xj. 
Suppose that the n objects x have been clustered into K clusters Ci,...,Ck using 
some clustering algorithm (e.g., the K-means algorithm). The natural loss is the 
within- cluster sum of dissimilarities 



A" 



k=l i,j£C k 

When number of clusters K increases, Wk generally decreases. Let B be the set of 
B bootstrap resamples x' from the actual data x, we define the loss rank of using 
K clusters as in (JBJ) by 

4{x' e B : W K (x') < W K {x)} 



LR B (K\x) 



B 



The optimal K selected by the LoRP will be i^LR = argminA'LR,B(-R'|:E). 

A popular method i n the literature for selecting K is the criterion proposed in 



Calinski and Harabaszl (119741 ) 



CH(A) = W K /(n- K y 

and the K selected is the one maximizing this criterion. Note that the CH criterion 
is not defined for K—l. In the following simulation we compare the performance of 
the LoRP with that of the CH. 

Simulation. We generate 2-dimensional datasets with various settings: 

• 2 clusters, each with 50 observations, are generated from 2-dimensional normal 
distributions N(fi,al) with /i = (0,0), (0,5) and cr = 1,2,3. 

• 3 clusters, each with 50 observations, are generated from 2-dimensional normal 
distributions N(fi,al) with /i = (0,0), (0,5), (5,0) and a = 1,2,3. 

• 4 clusters, each with 50 observations, are generated from 2-dimensional normal 
distributions N(fi,crl) with yU=(0,0), (0,5), (5,0), (5,5) and a = 1,2,3. 

We measure the performance in terms of percentage that the true number of clusters 
is correctly identified, over 100 replications for each setting. The simulation result 
is summarized in Table |5J It seems hard to compare the performance of the two 
methods. While the CH outperforms the LoRP for "easy" cases (small a), the LoRP 
outperforms the CH for "hard" cases (large a). However, our main interest is not the 
quality of the LoRP in this particular example, but to show that the unsupervised 
LoRP developed above as a general-purpose principle works for selecting number of 
clusters. 
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Table 2: Percentages of correct identification over 100 replications. 



4.2 LoRP for graphical modelling 



We study in this section the unsupervised LoRP for structural learning in graphi- 
cal modelling, mainly focus on Marko y networks with discrete- valued vertices (also 
called graphical log-linear modelling) flWhittakerl . fl99~ol ; lEdwardsl . l2000h . but exactly 
the same idea would work for Bayesian networks as well. 

Graphical modelling. The basic idea of graphical modelling is to use graphs to 
represent the independence structure among a set of variables. A graph is a pair G— 
(V,E) where the vertex set V consists of a finite set of random variables and the edge 
set E represents the (conditional) independence relations between the r.v.'s in V . For 
every u,v^V, if u and v are conditionally independent given all the other variables 
in V then the edge (u,v) is not included in E. In other words, a non-adjacent pair of 
vertices can be immediately interpreted as being conditionally independent given the 
rest. Graphical modelling provides an efficient way to represent and communicate 
the conditional independence relations between a set of r.v.'s. We restrict ourselves 
to undirected graphs (also called Markov networks) in this paper, but the same idea 
can be directly adapted for directed ones (or Bayesian networks). 

Most of the literature on graphical modelling is concerned with selecting an ap- 
propriate model to explain the dat a. The most popular method is stepwise selection 
(IWhittakerl . Il990l : lEdwardsl . l2000h which starts at an initial base model and moves 
to next step by including or excluding a single edge until some termination criterion 
is fulfilled. Stepwise selection i s search-efficient but its ma i n dra wback is that it may 
get stuck in a local optimum ( Whittaker . 1990l ; Edwards . 2000l ). Furthermore, it is 
not easy to understand the statistical properties of the selected model. 

Another criterion can be used for grap hical model selection is the Bayesian in- 
formation criterion (BIC) (jSchwara . Il978l ). BIC of model G has the form 



BIC(G) = — log(maximum likelihood under G) + -(^free parameters of G) \ogn 

It is well-known that BIC is asymptotica lly able to identify the true model (if it 
exists). One may use AIC (lAkaikd . 1 19731 ) as a selection criterion as well, but AIC 
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tends to select overfitted models. AIC is optimal in terms of mean squared error loss 



(jShibital . ll984j ). however, this quantity is not well-defined in the graphical modelling 



context. 

Loss rank criterion. Let V be a set of k discrete variables, and for each v G V let 
X v be the set of its possible values/ levels. The dataset of size n is cross-classified 
by the levels of variables in V. Let Z= £g>„ g yZ„. The dataset is often conveniently 
given in the form of a contingency table with cell counts n = {ni} i£ x where rij is the 
observed number of observations cross-classified into cell i, Y2 n i = n - The sampling 
distribution of n is often assumed to be multinomial 

/is Tl\ T~tr m i\n 

p(n|m) = TW^S(-) ' 

where m={mj}j e j, rrii are the expected numbers of observations falling into cells % 
out of total n observations, ^i m « = n - 

Let {rhi(G,n)} be the MLE of m; under model G. We define the empirical 
loss function resulting from fitting data n by model G as the negative maximum 
log-likelihood (neglecting the constant terms depending only on n) 

Loss G (n) := [n* log(m;(G, n)) - log(n t !)] . 

i 

This empirical loss is not a suitable measure for model selection, because the larger 
the model G (w.r.t. inclusion), the smaller the loss. From fl9]), the loss rank of model 
G is 

LR n (G) := P n (Loss G (n') < Loss G (n)) (10) 



where P„ denotes the bootstrap empirical measure (lEfron and Tibshiranil . Il993l ) 
and n' = {ra-}j g j is a bootstrap resample from the actual data n. The graph to be 
selected will be Glr = argminLR n (G). We call this strategy the loss rank criterion 
for graphical model selection. 

Similar to the classification case, it is straightforward to estimate the loss rank 
(TTTJT) by a simple Monte Carlo algorithm. In the following simulation, we estimate 
LR„(G) by an average over 5 = 200 bootstrap resamples n' from n. From our own 
experience, the result does not change much if a larger number of replications is 
used. 

Note that definition fflQ|) is somewhat similar to definition ()5]) of the loss rank 
for classification. However, the proof technique in Theorem [1] seems not to apply 
here because the derivations ©-(IT]) in the proof are not valid anymore. Instead, 
in the following we will evaluate the suggested strategy by means of simulation. A 
theoretical justification is left for the future work. 

In order to help the reader grasp better how the LR criterion works, we first 
present here a simple example where an exhaustive search over model space is pos- 
sible. For the case of large number of vertices k, we will derive a genetic algorithm 
to overcome the difficulty in searching over huge model spaces. 
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A simple example. We consider a simple example where the number of vertices 
is k = 3, and each variable takes on 3 values/levels. The number of graphs then is 
2(2) =8. For a given sample size n, 100 datasets are generated from the "true" model 
with formula 12/23, i.e., the first and third variable are conditionally independent 
given the second. We evaluate the performance in terms of proportion of correct 
identification over 100 replications. Table [3] shows the performance of LR in com- 
parison to that of BIC. The simulation result suggests that LR is superior to BIC. 



n 


200 
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1000 


2000 


5000 


LR 


.2 


.7 


.9 


1 


1 


BIC 


.05 


.4 


.7 


.8 


1 



Table 3: Proportions of correct identification of LR and BIC for various n 



This result is similar to the simulation result in (iHutter and Tranl . |2010| . Table 1) in 
which it was also shown that the LoRP works better than BIC for model selection 
in linear regression. 

Graphical model selection with LR criterion and a genetic algorithm. The 

main difficulty in graphical model selection is that the number of models is increasing 
more than exponentially as the number of vertices increases. Model selection can 
be seen as a problem of searching for the optimal solution, w.r.t. a certain selection 
criteri on, over the model space. A natural choice is to adapt genetic algorithms 



(GA) (IHollandl . 1 19751 ; IMitchelll. I1996T) for searching o ver the model sp ace. Th i s idea 
has been already taken in ( iPoli and Roveratol . Il998l ) who used AIC (lAkaikd . Il973l ) 
as the selection criterion and proposed a genetic algorithm for model search. Here, 
we adapt their genetic algorithm for model search and use the LR criterion as the 
selection criterion. 



Genetic algorithms ( IHollandl . Il975l : IMitchelll . Il996l ) are widely used to search for 
optimization solutions when the solution space is huge. The basic idea of GA is to 
mimic the evolutionary processes of creatures in which they attempt to find better 
solutions to the given problem by generating successive generations of individuals 
that are expected to be better suited to the environment than their ancestors. 

Solutions are typically encoded by binary strings, called chromosomes. Chromo- 
somes are associated with a fitness function and the problem is to find the fittest 
individual. The search space consists of all possible chromosomes, which is typically 
infeasible to access every individuals. A GA starts by generating an initial pop- 
ulation and proceed by applying in turn three operators: selection, crossover and 
mutation. Selection operator randomly selects parents from the current population 
with probability being an increasing function of fitness to form a new population. At 
this stage, another operator called elitism may be used, in which a certain number of 
fittest individuals in the current population are directly inserted into the new pop- 
ulation. Offsprings are obtained by applying the crossover to pairs of parents with 
a probability of p c - a pre-fixed number in [0,1]. The crossover typically consists in 
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exchanging certain bits of two selected chromosomes. Finally, the new generation is 
obtained by applying with a probability of p m the mutation operator which changes 
one or more bits of a chromosome. The procedure is repeated until a termination 
criterion is satisfied. A widely-used termination criterion is that the fittest does 
not change for the last, say T, iterations. Some theoretical conditions to assure the 
convergence to the global optimal were introduced, however, applications do not 
always follow. Therefore, for the problem at hand, it is recommended to run the 
procedure several times before making the final decision of the selected fittest. 

The specific application of GA to graphical model search consists in how to 
encode graphs as binary strings and in defining the operators selection, crossover 
and mutation. 

Firstly, an undirected graph G=(V,E) with k vertices can be totally represented 
by a (strictly) upper triangular matrix M.(G) = (m^- ) J>i in which = 1 iff there 
is an edge between the z-th and j-th vertices. The matrix Ai(G) in turn can be 
identified with a binary string 13(G) in which the entry of M. is stored at 
the corresponding position (k — l)(z — l) + (j — 1)— i(i — l)/2 of B. For example, the 
true model with formula 12/23 in the previous example can be encoded by the 
binary string (1,0,1). The length of binary strings encoding graphs with k vertices 
is k(k-l)/2. 

The fitness function is inversely proportional to the loss rank: the fitter an indi- 
vidual, the smaller its loss rank. The fitness proportionate selection is not suitable 
in the present context, because some loss ranks may be very close or even equal to 
zero, which may cau se premature convergence. Therefore the linear ranking selec- 



tion (IMitchelll . 119961 ) should be used. This selection operator starts by sorting the 
individuals in the decreasing (equivalently, increasing) order of fitness (loss rank). 
Then the probability for the z-th individual in the ranking to be selected is 

ft = I( j 9-203-l)g), 0e[l,2]. 

It seems that there is no clear suggestion on selection of /3. In our simulation, /3 
is fixed to 1.5. We also apply the elitism in which 5% of the fittest individuals are 
kept for the next population before applyin g the selection operator. 



We now follow iPoli and Roveratol (119981 ) to define the crossover operator. For a 



pair of parents models G 1 and G 2 , a subset A of V is randomly selected and two 
offspring are formed by exchangi ng the induced subgraphs G\,G 2 A . The motivation 



of this operator is interpreted in lPoli and Roveratol (Il998[ ). The mutation operator 



consists in randomly selecting a bit in a binary string and change its value (0 to 1 
and vice versa). The probability of doing crossover and mutation are fixed to p c = .9 
and p m = 0.01 respectively. These values are chosen based on our own experience 
and in reference to others 

Some authors regard model selection as more than a machine learning or sta- 
tistical issue, it is a philosophical one! Whether or not the true model exists is a 
controversial issue; another one is that whether or not one should select a single 
model and do subsequent inferences conditional on that selected one. It would be 
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risky to select a single model, especially out of thousands as in the graphical mod- 
elling, and proceed as if it was the true one. Even if the true model exists, it is 
unrealistic to expect the GA to be always able to find. It is therefore more rea- 
sonable to restrict our expectation to finding a set of appropriate models instead 
of a single "best" one (which may turn out to be an inappropriate model!). The 
selected set then serves as the basic for a further context-specific consideration. The 
idea of selecting a set o f mode ls instead of a single model has also been discussed in 



Roverato and Paterlinil (120041 ) 



We now present the algorithm formally for searching for a set H of appropriate 
models. The maximum cardinality of such a set is pre-specified, say K. The basic 
idea is to repeat the GA procedure several times with different initial populations. 
We start with "H = 0. After each iteration (of the GA procedure), a fittest individual 
is selected. This individual will be added to the optimal set H if it was not previously 
selected. The overall procedure stops when either the cardinality of H reaches K 
or % does not change for the last, say J, iterations. The following is the GA-LR 
pseudo-co de for ou r proc edure, (the readers who are not familiar with GA are 



referred to iMitchelll (119961 ) for the terminology) . 



The GA-LR algorithm 

%: = 0, j :=0, resampling B resamples 
[ While \H\<K and j< J do 

Generate an initial population P 

Calculate the loss ranks for models in P and select the fittest G* 
t: = 

\ While t<T do 

apply elitism and selection on P to form new population Pi 
apply crossover on Pi to form P 2 
apply mutation on P 2 to form P 3 
P:=Pz 

Calculate the loss ranks for P and select the fittest G' 
If G' = G* then t:=t+l else G*: = G'; t: = 
[ end while 

If G*eU then j:=j+l else H = HU{G*}- j:=0 
[ end while 



A simulation study. We consider a moderate example with 6 vertices, each vertex 
takes on two values. The total number of graphs is 2w =32768. Datasets of size 
n = 10000 are generated from the "true" model 123/456. In our simulation, the 
parameters T and J are fixed to 5, the size of initial populations is fixed to 100. For 
a pre-specified maximum cardinality K, each run of the GA-LR algorithm produces 
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a set % of optimal models in which \H\<K. We are interested in whether or not the 
selected set % contains the true model. A small K may be preferred because it eases 
the subsequent analysis, but important models are more likely to be missed. We 
evaluate the performance of selection criteria (LR and BIC) in terms of proportions 
in which the selected set % covers the true model. Table H] shows those proportions 
over 10 replications for various K . From the simulation results, we draw the following 
conclusions: (i) The GA-BIC algorithm often terminates before the maximum K 
is reached, i.e., the GA-BIC is more stable than the GA-LR; (ii) In contrast, the 
GA-BIC misses the true model more often than the GA-LR. An obvious drawback 
of the GA-LR is its computational time. In this simulation, each run of the GA-LR 
requires approximately 30 minutes which is about 50 times more than the running 
time of the GA-BIC. 



Table 4: Proportions of correct coverage for various K over 10 replications 

Remarks on computation aspects. The implementation is written in R, bene- 
fited from the R package igraph of Gabor Csardi. The simulation was carried out on 
a CPU Intel 2.66GHz. The software is freely available upon contacting the authors. 

5 Conclusion 

We have presented in this paper our continuous investigation of the LoRP, a general- 
purpose principle for model selection. The efficiency of the LoRP for model selection 
in classification was shown theoretically and experimentally. We also developed the 
LoRP for model selection in unsupervised learning settings and studied it by a means 
of simulation. 

A fundamental question in model selection is that what kind of model one would 
like to select: the true model (i.e., the model generating the data) or a useful model 
in some sense (often, in terms of prediction) or a parsimonious model that fits the 
data not too bad. The LoRP attempts to deal with the latter which is, by common 
consent, the most appealing one in the machine learning community. Our objective 
in this paper is to draw the reader's attention to a new methodology for model 
selection that seems to have a lot of potential, leading to a rich field. 
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